System Providing Expressive and Emotive Text-to-Speech

ABSTRACT

A speech to text system includes a text and labels module receiving a text input and providing a text analysis and a label with a phonetic description of the text. A label buffer receives the label from the text and labels module. A parameter generation module accesses the label from the label buffer and generates a speech generation parameter. A parameter buffer receives the parameter from the parameter generation module. An audio generation module receives the text input, the label, and/or the parameter and generates a plurality of audio samples, A scheduler monitors and schedules the text and label module, the parameter generation module, and/or the audio generation module. The parameter generation module is further configured to initialize a voice identifier with a Voice Style Sheet (VSS) parameter, receive an input indicating a modification to the VSS parameter, and modify the VSS parameter according to the modification.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of, and claims priority to,co-pending U.S. patent application Ser. No. 16/495,422, which was filedon Sep. 19, 2019, entitled “System Providing Expressive and EmotiveText-to-Speech,” which was a national stage entry of PCT Applicationnumber PCT/US18/24033, filed Mar. 23, 2018, entitled “System ProvidingExpressive and Emotive Text-to-Speech,” and which claimed the benefit ofU.S. Provisional Patent Application Ser. No. 62/475,296, filed Mar. 23,2017, entitled “System Providing Expressive and Emotive Text-to-Speech.”The disclosures of the prior applications are hereby incorporated hereinin their entireties.

FIELD OF THE INVENTION

The present invention relates to sound generation, and moreparticularly, is related to producing expressive speech from text.

BACKGROUND OF THE INVENTION

Various systems have been used to generate a synthesized audio voicerendering performance of a text string, for example, a sentence orphrase stored in a computer text file. The techniques used in thesesystems have been generally based upon statistical parametric speechsynthesis (SPSS), typically using Hidden Markov Models (HMM), DeepNeural Networks (DNN), and/or Artificial Neural Networks (ANN).

FIG. 1 is a schematic diagram showing a prior art SPSS system 100.Broadly, the SPSS system 100 may be broken down in to two majorcomponents: a training part 101 that creates and maintains a library ofacoustic speech features ISO, and a synthesis part 102 that applies thislibrary to a text input to produce a synthesized speech waveform.

Typical Statistical Parametric Speech Synthesis Systems may use HMIs,DNNs, and/or ANNs for training and synthesis parts respectively. TheSpeech Synthesis part may include but is not limited to the followingmodules: Conversion of text to phonetic descriptions module 110,Parameter Generation Algorithm module 120 (HMMs, DNNs, ANNs), Synthesismodule 130 (HMIs, DNNs, ANNs), Model Interpolation module (not shown)(HMMs, DNNs, ANNs), Short-Term Parameter Generation Algorithm module(not shown) (HMMs, DNNs, ANNs), and Vocoding module 140 (offline,real-time or streaming).

During synthesis, SPSS systems compute a vector C of static and dynamicvoice features via maximum likelihood parameter generation (MLPG) bymaximizing over all available phonetic contexts provided by the phoneticlabels of the input text.

SPSS streaming synthesis, for example Mage/pHTS, may be used to modifyspeech at three levels: phonetic context, parameter generation, and atthe vocoder level. Phonetic context controls what is being said,parameter generation controls the parameters of the voice model such asprosody, speaking style and emotion, and the vocoder level controlmanipulates individual frames while the synthetic speech is beinggenerated. Therefore, with SPSS streaming synthesis, it is possible tomodify the speech before and while it is being generated. This was notpossible with early implementations of SPSS where speech synthesisparameters were statically generated over the complete input sentence(input text). The introduction of streaming SPSS enabled speechsynthesis parameters to be generated within a small sliding window thatprovides variable control of a movable portion of the complete inputsentence as it is being rendered. There are a few examples ofalternative approaches that employ text markup to specific ranges of aninput sentence to indicate emphasis, or changes in speed. Some morerecent schemes have added detailed markup to alter rendering at thephoneme level, but these schemes only allow for duration and pitchcontrol. Therefore, there is a need in the industry to address one ormore of these deficiencies.

SUMMARY OF THE INVENTION

Embodiments of the present invention provide a system providingexpressive and emotive text-to-speech. Briefly described, the presentinvention is directed to a speech to text system including a text andlabels module that receives a text input and provides a text analysisand a label with a phonetic description of the text. A label bufferreceives the label from the text and labels module. A parametergeneration module accesses the label from the label buffer and generatesa speech generation parameter. A parameter buffer receives the parameterfrom the parameter generation module. An audio generation modulereceives the text input, the label, and/or the parameter and generates aplurality of audio samples. A scheduler monitors and schedules the textand label module, the parameter generation module, and/or the audiogeneration module. The parameter generation module is further configuredto initialize a voice identifier with a Voice Style Sheet (VSS)parameter, receive an input indicating a modification to the VSSparameter, and modify the VSS parameter according to the modification.Other systems, methods and features of the present invention will be orbecome apparent to one having ordinary skill in the art upon examiningthe following drawings and detailed description.

It is intended that all such additional systems, methods, and featuresbe included in this description, be within the scope of the presentinvention and protected by the accompanying claims.

BRIEF DESCRIPTION OF THE DRAWINGS

The accompanying drawings are included to provide a furtherunderstanding of the invention and are incorporated in and constitute apart of this specification. The drawings illustrate embodiments of theinvention and, together with the description, serve to explain theprincipals of the invention.

FIG. 1 is a schematic diagram showing a prior art SPSS system.

FIG. 2 is a schematic diagram showing an embodiment of an SPSS systemwith a control interface.

FIG. 3 is a schematic diagram showing an embodiment of a streaming SPSSsystem incorporating a scheduler.

FIG. 4 is a graph showing envelope nodes and modulation destinations.

FIG. 5 is a schematic diagram illustrating an example of a system forexecuting functionality of the present invention.

FIG. 6 is a flowchart showing an exemplary embodiment of a method forexecuting functionality of the present invention

FIG. 7 shows an example of a prior art Cascading Style Sheet (CSS)string.

FIG. 8 shows an example of a VSS string.

FIG. 9 shows an example of the speech parameter pitch.

FIG. 10A is a graph showing the phonemes present in a test sentence.

FIG. 10B is a graph that shows the output of graphical interface modulewhen VSS pitch controls are applied to the text sentence of FIG. 10A.

FIG. 10C is a graph showing the output of graphical interface modulewhen VSS duration controls is applied to the text sentence of FIG. 10A.

FIG. 10D shows the output of graphical interface module when VSScontrols for both duration and pitch are applied to the text sentence ofFIG. 10A.

DETAILED DESCRIPTION

As used within this disclosure, “prosody” refers to an indicator ofstress, meaning, emphasis, emotion, contrast, and/or focus in a spokenaudio language phrase, for example using rhythm, intonation, inflection,intensity, duration, amplitude modulation, stressed sibilance, and othervoice characteristics.

As used within this disclosure, a “rendering” refers to a text stringand a plurality of voice parameters and/or features configured to beconverted to an audio waveform, for example, via a plurality of audiosamples. The conversion to audio may be performed, for example, by avoice synthesizer configured to receive the rendering and produce theaudio samples and/or an audio waveform.

Reference will now be made in detail to embodiments of the presentinvention, examples of which are illustrated in the accompanyingdrawings. Wherever possible, the same reference numbers are used in thedrawings and the description to refer to the same or like parts.Existing SPSS systems described in the background section do not provideany facility for authoring graceful overlapping animation of multiplelow-level voice parameters concurrently, or the manipulation andanimation of anything other than predefined static text.

Embodiments of the present invention of a text-to-speech include adevice, and system providing a statistical parametric voice synthesizerthat enables independent control of several discreet elements of voicesynthesis of waveforms that emulate human speech in real-time. Theembodiments relate to a synthesizer that provides a statisticalparametric text to speech engine that is capable of responding toreal-time commands that control pitch, speed, vocal tract length,duration, speaking style, and other parameters. Further embodimentsinclude a method for authoring and displaying “animation” control dataspecifically tailored to manipulate a suitably responsive real-timespeech synthesizer.

The embodiments represent the following improvement over previous SPSSimplementations:

-   -   FIG. 2 shows an SPSS system 200 with an added control interface        210 for manipulating parameters in a text-to-speech (ITS)        system.    -   Conversion of streaming text to labels, and    -   FIG. 3 shows a scheduler 380 for streaming SPSS.

FIG. 2 shows an SPSS embodiment. A text string 220 undergoes textanalysis 230, resulting in N phonetic labels lexically and phoneticallydescribing the text string 240. This is used to access context dependentmodels for acoustic features and duration 250. An MLPG module 260provides parameter generation from the models for all of the labels,translating the controls to VSS including, for example, pitch (fO),spectrum, duration, vocal tract length, and aperiodicity per frame forall the labels 270. The control interface 210 provides for real-timemanipulation of these parameters. A vocoder 280 synthesizes a set ofaudio samples, which may be processed, for example, via a digital toanalog converter to produce a synthesized speech waveform 290 that maybe rendered by a rendering system such as an audio amplifier and anaudio transducer (not shown).

The control interface 210 may include a graphical display, for example,a touch screen or another display that allows for manipulation ofgraphical and/or text objects using an input device, for example, amouse, touchpad, keyboard, or track ball, among others. For example, thetouch screen may detect a single touch gesture and/or multi-touchgesture and convert the gesture into a command to control a speechparameter.

In one embodiment of a text-to-speech (TTS) system, appropriatelyformatted VSS instructions are authored by a human animator. These VSSperformance descriptions are ingested, parsed, and converted into aninstruction set that manipulates the various parameters of a specializedTTS playback mechanism enabling it to respond expressively toproprietary control data that controls the rendering by an audiotransducer in real-time as “animated TTS,” giving the system the abilityto “act” in ways that are far more subtle and detailed than traditionalTTS systems which rely on algorithmically generated/simulatedinflections that are not authored by a human animator, and/or are notalgorithmically predicted and generated speech samples. This embodimentis analogous to an animated wireframe maquette, or preliminary model orsketch, used in visually animated computer graphic story telling forfilm and television. This system controls what can be thought of as a“voice maquette” or voice model that is a sonically malleable speechsynthesizer controlled via specialized animation controls which arepre-authored by a human “animator” to deliver a performance of speechthat enacts the dialog or scene in a suitable way to convey theemotional or semantic content at hand.

An example of a first aspect of the embodiment may be a commerciallyavailable piece of hardware capable of interpreting and renderinganimated speech. An example of a second aspect of the embodiment may bea tool, for example an internal tool or an external application to helpdevelopers of speech systems author animated speech for rendering on thehardware of the first aspect. For example, the internal tool may be aframework/application used to craft the performance parameters of thevoices while an external application may be a variation of the internaltool with a simplified GUI that allows the user to personalize certaincharacteristics of the voice of his device.

The several features of the TTS Tenderer may be controlled independentlyto deliver complex vocal performances. These controls may affect audibleparameters such as the pitch, tempo, duration, and timbre of the voice.Each of these parameters may be addressable on a sub-phoneme levelexplicitly bound to a particular phrase or piece of dialog.Additionally, this speech synthesizer can bind envelope controls to arange of appropriately tagged dynamic content, or higher-order elementsof speech such as grammatical structures. In addition to a library ofknown phrases that have been prepared with human authored animations orperformances, the system under the first embodiment may encountering aphrase that has not been rendered before but contains a grammaticalfeature the system has previously stored as an animation.

An example of such a previously stored animation is a greeting. A hostapplication may have an affordance whereby the user can teach the systemtheir name. When the system hosting the host application greets the userby name, it may have a series of pre-animated contours to play back forgreetings of various syllable counts. For example, “Hi Dave” vs “HiSebastian”. Even though no explicit animation exists for the nameSebastian, the system may map animations across phrases elements thatare of unpredictable length but can be recognized as belonging to aclass of utterances that the system may encounter. In another example:An animation could be authored to handle “contrastive” sentences. “Ican't do X, but what I CAN DO is Y.” Here again, the system could haveanimations that are bound to structural elements and not simply tied toprescripted text strings.

Some previous systems use a text markup scheme similar to HTML to impose“expressivity” onto computer generated speech, where words or parts of aphrase may be surrounded with tags that tell the TTS engine to adjustthe rendering of speech in some particular way, for example, raising thepitch or adjusting the volume. But this markup is rarely at the phonemelevel, and those systems which do allow this level of detail do notallow for independent control over amplitude, vocal tract length,duration, etc. Additionally, those systems that do offer some limitedset of phoneme controls do not enable the injection of “wildcard” wordsor phrases into the animation control stream. An advantage of thepresent embodiment is that animation envelopes (see FIG. 9 ) may beapplied to explicit text or more abstract features of language likecertain types of grammatical construction or sentence graphs. Somecontrol data may be bound to explicit text, while other data may onlyapply to more high-level abstract aspects of the content if they arepresent, such as parts of speech, grammatical structures, or evenphysical aspects of the content like ‘the last three words of thesentence.’

The present embodiment may be viewed as text to voice generationsomewhat analogous to computer graphic animation. Computer graphicanimation capability has grown in sophistication over time and nowappears in a wide variety of popular entertainment forms. Feature lengthmovies solely populated by computer generated actors were unthinkable 30years ago. Currently computer animation appears in some form in almostevery film made today.

A given text sentence may be converted into a collection of frames thatare provided to a vocoder to convert into speech/sound. The controlinterface 210 for the vocoder relies significantly on trajectories,which include duration, fundamental frequency and spectral coefficients.Depending on the vocoder chosen for analysis and synthesis other featuretrajectories and parameters may be present, such as aperiodicity andfrequency warping. A parameter trajectory may be sequential, such thatthe present control parameter for a present frame, for example, a frameat time t, the interface relies on the parameter trajectories generatedfor the previous frame, for example at time t−1. A simple trajectory,for example, a trajectory under 0.1 sec would pass undetectable for theuser, or a trajectory depending upon, for example, 20 future frames ofabout 5 ms/frame may only rely on the previous frame, while more complextrajectories may rely on the previous two or more frames, providingcontinuity to an utterance. The generation of a trajectory may also relyon future frames. Generating the parameter trajectories involvesgenerating information for every frame, including, but not limited to:fundamental frequency, spectrum and aperiodicity. In particular, inorder to describe each frame, one value may be used for each of afundamental frequency (fo value in Herz for voiced frames/0 for unvoicedframes), a vector of spectrum coefficients, for example, 60 spectrumcoefficients, and a vector of aperiodicity coefficients, for example 5aperiodicity coefficients, as described further below. For a specificvocoder, in order to define a frame one fundamental frequency value, Xspectral coefficients and Y aperiodicity coefficients may be needed, butvalues X and Y may differ depending on the sampling frequency of thedata. The higher the sampling frequency, the higher the number ofcoefficients. In case of another vocoder, for example MLSA vocoder, useof one fundamental frequency value, 35 cepstral coefficients and noaperiodicity coefficients may suffice.

According to one embodiment, the system may generate 3 files, one foreach trajectory (fundamental frequency, spectrum and aperiodicity). Thefundamental frequency trajectory contains 1×N values where N is thenumber of frames predicted for the input text. The spectrum trajectorycontains M×N values where M is the order of the coefficients used whileN is the number of frames predicted for the input text, and theaperiodicity trajectory contains M×N values where M is the order of theaperiodicity coefficients while N is the number of frames predicted forthe input text. Please note that the values of 60 and 5 for the spectrumand aperiodicity respectively may vary. For example, the values for thespectrum and aperiodicity may depend on the analysis window which, inturn, may depend on the sampling frequency of the data. If the trainingdata are sampled at 16 KHz it may be desirable to use an FFT analysiswindow of 512 samples rather than one of 2048 samples that may bepreferable for data sampled at 48 KHz. Then depending on the samplingrate of the data and the granularity the values may increase ordecrease. For example, for the WORLD vocoder, default parameters for asampling rate of 48 KHz are 60 and 5 for the spectrum and aperiodicityrespectively.

The control interface 210 provides a graphical user interface (GUI) tothe user where the generated parameters may be represented astrajectories, for example, a GUI that reads these three parametertrajectory files and present them graphically, for example, in twodimensional graphs. In an exemplary two dimensional graph shown by FIG.9 , the y-axis represents the value read and the x-axis indicates thefame count.

In addition to trajectories of the three parameters, the GUI maynumerically and/or graphically represent parameters such as time elapsedand relative speed of delivery of the rendered utterance, as well asother parameters of the vocoder, such as vocal track length. The GUIallows trajectories to be accessed and/or modified, for example, bychanging the values on the x-axis and/or y-axis or of the graphs, or byother means, for example, using pop-up menus, text boxes, or othergraphical interface tools. The modifications by the user are then usedto regenerate the parameter trajectories so that they reflect theintention of the user on the controller based on the modification.

The embodiments may also translate a Cascading Style Sheets (CSS) typestructure into one or more trajectories. The present embodiment isanalogous to a CSS and is called a Voice Style Sheet (VSS). VSS isapplied to speech processing, in order to create/apply stylisticcontrols over the generated speech parameters and therefore affectingthe final speech output. The controls present in a VSS file may betranslated into frames, or any other unit where the controls may beapplied, for example, a word, phrase, or sentence, and applied onexisting trajectories. In general, the controls are applied to frameseven if the control level is directed to a higher level abstraction, forexample, controls applied to a whole phrase are translated into andimplemented upon frames.

In a similar manner, the controls that are manually input by a user inthe GUI may be translated for storage in a VSS file and saved for futureuse. Unlike previous text-to-voice system, the control interface 210 forthe present embodiment allows the user to:

-   -   Correct the pitch generated by the system    -   Correct the duration of silences and pauses generated by the        system    -   Fine tune prosody for appropriate system responses    -   Modify prosody in order to have questions, exclamations, etc.,        and    -   Modify the overall personality of the voice.

The vocal markup tool provides for graphical manipulation of parametersused for preparing an input text string for rendering by a speechsynthesizer. In particular, the vocal markup tool adds symbols and textto the text string to provide rendering instructions to the speechsynthesizer.

The markup symbols may generally indicate a value or range one or morevocal parameters, such as pitch, duration, amplitude, vocal tract (e.g.,size of voice box, length of the vocal tract, etc.), sibilance, prosodywidth (the amount of pitch inflection applied to speech), and silences(time gaps between audible utterances). The markup tool may also be usedto probabilistically determine the occurrence or value of a parameterbeing utilized. This may be used to prevent repeated utterances fromsounding identical. For example, the timing of a breath in a phrase, orthe exact pitch used for a specific word or phrase may be affected by adegree of randomness applied to a specific parameter. The user mayspecify a degree of randomness applied to a given speech parameter, forexample, in one of two ways: (1) by specifying a high and low range forthe parameter's value, or (2) by specifying the probability that theparameter adjustment will be applied during the current rendering. Atrendering time, the VSS is evaluated, and any randomized parameters arerendered accordingly.

While the markup language uses text and/or symbols to indicate each ofthese parameters in relation to a textual word (or a portion of atextual word), the markup tool presents the parameters graphically sothe user (voice animator) may visually interpret the parameter, and tomanipulate the parameter, for example, using a mouse or track pad.

For example, a pitch block may represent the pitch of the voice to berendered via a graph of frequency (x) vs. time (y), such that the heighta line representing the pitch corresponds to a change in pitch(frequency). The pitch line may include one or more handles or markers,for example, a black dot on the pitch line, that may be manipulated tochange the pitch. The user may insert additional handles on the pitchline to change the time granularity control of the pitch. Other toolsmay be used to manipulate the pitch, such as curve generators (to ensuresmooth pitch transitions) or a granular step tool, to ensure that thepitch snaps according to specific allowed values.

Similarly, durations of a desired parameter may be controlled by sizeand placement of a graphical marker along the time (y) axis.

Various graphical tools may be assigned to a particular parameterdestination. Such graphical tools may be configured to enhance atrajectory generated by the system (enhance mode), or a modulator may beconfigured to replace a trajectory generated by the system.

Destinations controlled by a graphical tool may include, but are notlimited to pitch, duration, amplitude, vocal tract, sibilance, prosodywidth, and silences.

An envelope is a graphical tool that modulates sound over in a series oftime segments. A typical envelope may have three time segments: attack,sustain, and decay. More complex envelopes may break up each of thesetime segments into two or more sub-segments. When a sound producingsource (an oscillator) produces sound, the loudness and spectral contentof the sound change over time in ways that vary from sound to sound. Theattack and decay times of a sound have a great effect on the soniccharacter of that sound. Sound synthesis techniques often employ anenvelope generator that controls a sound parameter at any point in itsduration. Most often this envelope may be applied to overall amplitudecontrol, filter frequency, etc. The envelope may be a discrete circuitor module or may be implemented in software.

FIG. 3 depicts a scheduler module 380 that monitors and registers theactivities of a text and label module 310, a parameter generation module330, and an audio generation module 350. The scheduler module 380 isdescribed in further detail below.

FIG. 4 is a plot diagram of envelope nodes and speech functions showinghow four separate CSS-like files act on a single utterance. In thisexample, each envelope applies to a specific parameter of the speech tobe generated. This yields overlapping animation of speech parametersacting on a single utterance.

The animation of a text stream may include one or more envelopes thatmodulate the value of one or more speech parameters. Envelopes areplaced at specific locations within a phrase by means of an origin ofthe envelope FIG. 9 . The placement of the origin may be specified inone of several ways. For example, a specific phoneme may be used toanchor the origin of a VSS envelope, such that the occurrence of theidentified phoneme results in the placement of the origin point of anenvelope targeting a specific speech parameter. Similarly, a word may beused to set the origin of a modulator at a particular location in asentence, for example at the third word of the sentence. FIG. 7 shows anexample of a prior art CSS string that denote the look and placement oftext. In contrast, FIG. 8 shows a VSS string, which provides envelopecontrol over various speech parameters used for rendering the expressiveTTS.

Combining streaming SPSS with A Ns may provide the following benefits:

-   -   Leveraging ANN output quality & streaming controllability,    -   Starting system response instantly & streaming it to the user        while still rendering an output,    -   Altering the response to the user on the fly, for example,        cutting the rendering short or adding further information after        rendering speech has already begin,    -   Minimize system latencies,    -   Reducing the computational system load by computing & optimizing        smaller parameter sets, and    -   Changing the speaking style while the system is responding.

The scheduler provides precise control over the speech renderingparameters delivered to the vocoder. The schedule assures certainty onthe correct interpretation and application of the VSS on the parameters.Timestamped data provides information on where and when which controlsare appropriately applied. For example, the position (timing) of asimulated breath in the context of a spoken phrase may affect thedelivery of subsequent portions of the text. Further, the schedulermakes it possible to regenerate data with a given control set or VSS.

Given a control set VSS but without the use of a scheduler, the controlsmay be applied on the phrase/word/frame, and eventually samples atrandom times in a multithreaded architecture. Therefore, for a giventext phrase and a given VSS control set every synthesis iteration mayresult in slightly different rendering of speech samples. The differencein time may vary due the workload of the generation threads, theprotected/mutexed areas as well as the overall processing load of thesystem/device running the synthesizer.

On the other hand, given a control set VSS and a scheduler, the controlsmay generally be applied every time to the same segment of speech in adeterministic fashion, and therefore for a given text phrase and a givenVSS control set every synthesis iteration will result in exactly thesame speech samples. In general, use of timestamps by the schedulerensures that the VSS controls are applied to the phrase/word/frame inthe exact time that this segment is being processed, without beingaffected by any processing load of the threads or the system. An examplefragment of VSS is typically formatted as follows:

.pitch_example{  speech_parameter: pitch;  origin: 2wd;  width: 50fr75fr;  amplitude: 140%;  sustain: 60fr; }

FIG. 9 shows an example of an envelope applied to the speech parameterpitch. In general, speech parameters may be modulated by specifyingenvelopes. VSS envelopes have an origin which is the point that fixesthe peak of the envelope for a given trajectory. The origin is the pointfrom which the curve of the envelope is calculated. The origin of anenvelope can be placed using any one of several reference markers withinthe phrase.

For a word, the origin may be placed on a stable part of the stressedphoneme of a given word. Reference to the word may be indexed by a wordcount, for example, the first word in the sentence may written as“origin: 1wd;”. Note that the word indexing may also be applied backwardfrom the end of the sentence, “origin: −2wd;” would put the origin onthe stable part of the stressed phoneme of the second to last word inthe phrase or sentence.

For a frame, the origin may be placed at an explicit frame within thephrase or sentence. This may be written, for example, as “origin: 110fr;”

For a percentage, the origin may be placed at an arbitrary percentage ofthe way through the phrase or sentence. In this way, “origin: 25%;”would center the original ¼ of the way through the entire sentence. Noteleading and trailing silences may not be included in the totalpercentage of the phrase but pauses in the sentence may be included.

For a phoneme ID, the animator may target a specific phoneme in asentence using the origin.

The pitch of any phoneme can be adjusted by setting the width using theorigin statement. For example, using the origin: “origin: 1aa;” firstoccurrence of phoneme “aa” would be targeted while by using the origin,“origin: −1aa;” the last occurrence of phoneme “aa” in the test sentencewould be targeted. A wildcard indicator, such as “origin: *aa;” targetsall occurrences of the phoneme “aa”.

FIG. 10A, shows the phonemes present in the test sentence “Hey Bob, howare you?” In order to alter the pitch of phoneme “ey” in the first wordof the sentence (Hey), the statement: “origin: 1ey;” may be used tocenter the origin in the middle of the stable state of the givenphoneme.

The purpose of this control is apparent when applied to voiced parts ofthe speech, meaning the parts that have non-zero value as pitch. It mayalso be applied on unvoiced parts of the sentence, such as consonants,pauses and silences, however there may be no audible result there. Thephonemes used in examples herein are from the English language,represented in ASCII. However, in addition, foreign phonemes may also beused to better pronounce foreign words in a more natural andunderstandable way.

The controls may be applied to the phonetic level of text to achieveimproved granularity of the sculpting. The control sets are expandableto refer to linguistics features as well, enabling the animator totarget specific words in the text and/or punctuation like commas,exclamation marks, full stops, etc.

The width describes a duration or temporal effect of a particular curve,indicating, for example an amount of time to reach full effect and howquickly it decays back to the original level. Both the attack and decaymay share the same value, for example, if the width attribute isfollowed by a single value, then the attack and decay time are equal. Ifon the other hand, two values are specified, the first value mayindicate the attack time (duration) and the second value may indicatethe decay time. The format may be presented as:

Valid Parameters—[word j frame j percentage] for example:

 width: 1wd; width: 35fr 10fr;  width: 30%;

The amplitude of the curve will scale the pitch by percentage only. Forexample, a pitch amplitude of 100% has no effect on the pitch, while apitch amplitude of 50% lowers the pitch by an octave, and a pitchamplitude 200% raises the pitch by an octave.

The sustain parameter controls the duration of time the curve holds atits peak amplitude. Regardless of the length of the sustain parameter,the width/attack/decay values stay the same, as shown in FIG. 9 . Voiceidentity may be used to differentiate individuals and includes voicetraits that may modify the voice model as a whole and result inproducing a final voice attributable to a distinctly identifiableperson. By accessing and modifying the parameters that control voiceidentity traits, the user/customer may create a personalized voice for aspecific system. Such traits may include, for example, the vocal tractlength, the lower and upper limit of pitch, the overall duration and theoverall pause duration. Voice identity parameters include, for example,vocal tract length, pitch range, overall duration, and overall pauseduration.

Raising the value of the vocal tract length parameter corresponds toincreasing the length of the vocal tract of the speaker. A longer vocaltract results in a deeper sounding voice. Similarly, lowering the vocaltract length parameter corresponds to decreasing the vocal tract lengthof the speaker. This results in a higher pitched voice, for example,more like a cartoon character voice. In combination with the actualgender of the voice model, this may result in having a female voicemodel that sounds more male and vice versa.

By altering the general lower and upper limit of the voice pitchparameters, the generated pitch contours may be scaled within theselimits. This results in changing the fundamental frequency of the voiceand thus a part of its identity. The same voice can sound generallyhigher, lower or broader and thus change the perceived personality ofthe voice. This control may also be paired with the vocal tract lengthcontrol for more realistic results.

The overall duration parameter controls the amount of time between thebeginning and ending of an uttered phrase. Increasing the overallduration of the generated voice produces an effect of a more explanatoryand calm voice, while decreasing the overall duration produces theeffect of the speaker having a more active and energetic voicecharacter. The overall duration parameter controls the amount of timebetween text components of a rendered phrase, for example, the timebetween words and/or the time between sentences or phrase portions. Bymanipulating the duration of ail the generated pauses in the speech,both alone and in combination with the control of the overall duration,the voice is able to project a distinguishable and identifiable speakingstyle.

The above described parameters may be applied to a text phrase, forexample, by parsing VSS. For example, a collection of VSS may include:

// comment .pitch_song{   speech_parameter: pitch;   origin: 50%;  width: 0fr 75fr;   amplitude: 120%;   sustain: 5fr; } // comment.destination{  speech_parameter: duration;  origin: 220fr;  width: 50fr; amplitude: 90%;  sustain: 20%; } // comment .artist{  speech_parameter:pitch;  origin: −2pau;  width: 50fr 75fr;  amplitude: 140%;  sustain:60fr; }

As the format of the VSS is well structured, a VSS file may be parsed byuse of regular expressions to retrieve all the provided information. Anexample of text is: Text: Now playing, $SONG, by $ARTIST in the$DESTINATION.

An example of text with descriptors is. in

Now playing, <song> $SONG </song>, by <artist> $ARTIST </artist> the<destination> $DESTINATION </destination>.

By incorporating the descriptors into the generated text, informationabout where to apply a particular control may be extracted by usingregular expressions.

The given text may subsequently be converted into phonetic labels forparsing. The phonetic labels may be structured using specific linguisticfeatures. For example:

-   StartTime EndTime PreviousPhoneme—CurrentPhoneme—NextPhoneme

As these phonetic labels have a very specific format, regularexpressions may be used to parse them and retrieve necessary informationto successfully apply the VSS controls. Information extracted from thelabels include but is not limited to:

-   -   Obtaining a starting time;    -   Obtaining an ending time;    -   Setting the position of a phoneme in the phrase;    -   Identifying a current phoneme;    -   Determining whether a given phoneme is a vowel; and    -   Determining a stable state of a phoneme,

FIG. 6 is a flowchart 600 for an exemplary embodiment of a method forexecuting functionality of the present invention. It should be notedthat any process descriptions or blocks in flowcharts should beunderstood as representing modules, segments, portions of code, or stepsthat include one or more instructions for implementing specific logicalfunctions in the process, and alternative implementations are includedwithin the scope of the present invention in which functions may beexecuted out of order from that shown or discussed, includingsubstantially concurrently or in reverse order, depending on thefunctionality involved, as would be understood by those reasonablyskilled in the art of the present invention.

FIG. 6 is described using the embodiment of a text-to-speech scheduler380 shown in FIG. 3 . A voice ID is initialized with global VSSsettings, as shown by block 610. A text and labels module 310 receivestext 305 as input, as shown by block 620. The text and labels moduleanalyzes the text 312 and generates phonetic labels describing the text314, as shown by block 630. The text and labels module 310 stores thelabels in a label buffer 320, for example, a circular buffer. Aparameter generator module 330 accesses the label buffer 320 andgenerates duration parameters, as shown by block 640, among otherpossible parameters. The parameter generation module may include, forexample, a context-dependent statistical model 332 for acoustic featuresand duration which is referenced for generation of parameters 334. Theparameter generation module 330 may include a subsystem for controllingspectrum pitch aperiodicity and vocal tract length durations 336. Forexample, a control interface 210 (FIG. 2 ) may interface with theparameter generation module 330 to display generated parameters and toprovide an interface allowing for manipulation of the generatedparameters in real time, for example, modifying durations with VSS, asshown by block 650, and generating acoustic features as shown by block660 and modifying acoustic features as shown by block 670. The parametergenerator module 330 stores the parameters in a parameter buffer 340,for example, a circular buffer.

An audio generation module 350 receives the parameters from theparameter buffer 340 and synthesizes audio samples based on the receivedtext and the parameters, as shown by block 680. The audio samples may begrouped into segments, for example, according to sections of thereceived text, and stored in a sample buffer 360. An audio module 370accesses the samples from the sample buffer 360 and renders audio. Forexample, the audio module 370 may include a digital-to-analog converter(DAC), an audio amplifier, and an audio transducer, such as a speaker.

While the embodiment shown in FIG. 6 indicates modification of durationand pitch, similar methodology may be applied to vocal tract length,voice identity parameters, and text descriptors.

The VSS for voice identity may be implemented at the initialization ofthe system and thus may be considered as global settings of the voice.Any further VSS may be applied on top of these voice identitymodifications.

The VSS for duration may be applied just after generating the durationsfrom voice models, while acoustic related parameters such as pitch andvocal tract length VSS, but not limited to pitch and vocal tract lengthVSS may be applied just after generating acoustic features and justbefore vocoding (conversion into sound samples).

It is important to note here that the sequence of VSS applicationgenerally does matter, and commutative properties may not apply betweenpitch and durations. For example, the audible result may be different ifa vowel duration is stretched before changing its pitch, rather than thepitch being altered before stretching the vowel duration. Although anynumber of VSS fragments with any possible style sequences may besupported, the approach used in the first embodiment is to first computeand apply the VSS to the durations, and then apply the VSS to the pitch.The reason for this is that once the durations are correctly set inplace then the pitch controls are typically easier and more meaningful.Additionally, applying VSS to duration before pitch enables efficientsupport streaming architecture though intermediate generation steps,from label generation to final audio samples. The graphical interface ofthe control interface 210 provides the animator with visual feedback aswell as audible feedback from the controls described and applied in theVSS. This interface may depict, for example, the result of the VSS onthe pitch trajectory (pitch controls) and the number of samples(duration controls).

FIG. 10A shows the output of the graphical interface of the controlinterface 210 when no VSS is applied. On the very top, the synthesizedtext was “Hey Bob, how are you?” The generated waveform and the pitchcurve respectively appear just underneath. Lastly, FIG. 10A indicatesthe phonemes and their duration in frames. This helps the animator todecide which controls to apply where depending on the sculpting he/sheis targeting.

FIG. 10B shows the output of the graphical interface of the controlinterface 210 when VSS that only contains pitch controls is applied. Theeffect that these controls have on the pitch curve is apparent whencompared with the pitch curve in FIG. 10A. A dotted line represents thecurve used to alter the generated pitch and have a new sculpted pitchtrajectory.

FIG. 10C shows the output of the graphical interface of the controlinterface 210 when VSS that only contains duration controls is applied.The effect that these controls have on the number of samples generated(duration) may be seen when compared with the pitch curve in FIG. 10A. Adotted line represents the curve used to alter the generated durations(number of samples) resulting in a new sculpted duration trajectory.

FIG. 10D shows the output of the graphical interface of the controlinterface 210 when VSS that contains both duration and pitch controls isapplied. The effect that these controls have on both the number ofsamples generated (duration) and the pitch trajectory may be seen whencompared with the one in FIG. 10A. Dotted lines represent the curvesused to alter the generated durations (number of samples) and pitchindicating new sculpted duration and pitch trajectories. The presentsystem for executing the functionality described in detail above may bea computer, an example of which is shown in the schematic diagram ofFIG. S. The system 500 contains a processor 502, a storage device 504, amemory 506 having software 508 stored therein that defines theabovementioned functionality, input and output QJO) devices 510 (orperipherals), and a local bus, or local interface 512 allowing forcommunication within the system 500. The local interface 512 can be, forexample but not limited to, one or more buses or other wired or wirelessconnections, as is known in the art. The local interface 512 may haveadditional elements, which are omitted for simplicity, such ascontrollers, buffers (caches), drivers, repeaters, and receivers, toenable communications. Further, the local interface 512 may includeaddress, control, and/or data connections to enable appropriatecommunications among the aforementioned components.

The processor 502 is a hardware device for executing software,particularly that stored in the memory 506. The processor 502 can be anycustom made or commercially available single core or multi-coreprocessor, a central processing unit (CPU), an auxiliary processor amongseveral processors associated with the present system 500, asemiconductor based microprocessor (in the form of a microchip or chipset), a macroprocessor, or generally any device for executing softwareinstructions.

The memory 506 can include any one or combination of volatile memoryelements (e.g., random access memory (RAM, such as DRAM, SRAM, SDRAM,etc.)) and nonvolatile memory elements (e.g., ROM, hard drive, tape,CDROM, etc.). Moreover, the memory 506 may incorporate electronic,magnetic, optical, and/or other types of storage media. Note that thememory 506 can have a distributed architecture, where various componentsare situated remotely from one another, but can be accessed by theprocessor 502. The software 508 defines functionality performed by thesystem 500, in accordance with the present invention. The software 508in the memory 506 may include one or more separate programs, each ofwhich contains an ordered listing of executable instructions forimplementing logical functions of the system 500, as described below.The memory 506 may contain an operating system (O/S) 520. The operatingsystem essentially controls the execution of programs within the system500 and provides scheduling, input-output control, file and datamanagement, memory management, and communication control and relatedservices.

The I/O devices 510 may include input devices, for example but notlimited to, a keyboard, mouse, scanner, microphone, etc. Furthermore,the I/O devices 510 may also include output devices, for example but notlimited to, a printer, display, etc. Finally, the I/O devices 510 mayfurther include devices that communicate via both inputs and outputs,for instance but not limited to, a modulator/demodulator (modem; foraccessing another device, system, or network), a radio frequency (RF) orother transceiver, a telephonic interface, a bridge, a router, or otherdevice.

When the system 500 is in operation, the processor 502 is configured toexecute the software 508 stored within the memory 506, to communicatedata to and from the memory 506, and to generally control operations ofthe system 500 pursuant to the software 508, as explained above.

When the functionality of the system 500 is in operation, the processor502 is configured to execute the software 508 stored within the memory506, to communicate data to and from the memory 506, and to generallycontrol operations of the system 500 pursuant to the software 508.

The operating system 520 is read by the processor 502, perhaps bufferedwithin the processor 502, and then executed. When the system 500 isimplemented in software 508, it should be noted that instructions forimplementing the system 500 can be stored on any computer-readablemedium for use by or in connection with any computer-related device,system, or method. Such a computer-readable medium may, in someembodiments, correspond to either or both the memory 506 or the storagedevice 504. In the context of this document, a computer-readable mediumis an electronic, magnetic, optical, or other physical device or meansthat can contain or store a computer program for use by or in connectionwith a computer-related device, system, or method.

Instructions for implementing the system can be embodied in anycomputer-readable medium for use by or in connection with the processoror other such instruction execution system, apparatus, or device.Although the processor 502 has been mentioned by way of example, suchinstruction execution system, apparatus, or device may, in someembodiments, be any computer-based system, processor-containing system,or other system that can fetch the instructions from the instructionexecution system, apparatus, or device and execute the instructions. Inthe context of this document, a “computer-readable medium” can be anymeans that can store, communicate, propagate, or transport the programfor use by or in connection with the processor or other such instructionexecution system, apparatus, or device.

Such a computer-readable medium can be, for example but not limited to,an electronic, magnetic, optical, electromagnetic, infrared, orsemiconductor system, apparatus, device, or propagation medium. Morespecific examples (a nonexhaustive list) of the computer-readable mediumwould include the following: an electrical connection (electronic)having one or more wires, a portable computer diskette (magnetic), arandom access memory (RAM) (electronic), a read-only memory (ROM)(electronic), an erasable programmable read-only memory (EPROM, EEPROM,or Flash memory) (electronic), an optical fiber (optical), and aportable compact disc read-only memory (CDROM) (optical). Note that thecomputer-readable medium could even be paper or another suitable mediumupon which the program is printed, as the program can be electronicallycaptured, via for instance optical scanning of the paper or othermedium, then compiled, interpreted or otherwise processed in a suitablemanner if necessary, and then stored in a computer memory.

In an alternative embodiment, where the system 500 is implemented inhardware, the system 500 can be implemented with any or a combination ofthe following technologies, which are each well known in the art: adiscrete logic circuit(s) having logic gates for implementing logicfunctions upon data signals, an application specific integrated circuit(ASIC) having appropriate combinational logic gates, a programmable gatearray(s) (PGA), a field programmable gate array (FPGA), etc.

While the above description has generally described embodiments wherethe processing is performed in a single device or system, the methodsare also applicable to distributed systems and/or devices. For example,an alternative embodiment may render speech in the cloud and send downrendered audio files to be played back on a local device. For example,one embodiment may provide local voice input and output by rendering TTSlocally, while another embodiment may render TTS in the cloud andprovide the resulting audio samples to the local device.

It will be apparent to those skilled in the art that variousmodifications and variations can be made to the structure of the presentinvention without departing from the scope or spirit of the invention.

What is claimed is:
 1. A control interface device configured to producean output renderable by a speech synthesizer, comprising: a processorand a memory configured to store non-transient instructions forexecution by the processor; a display unit; an input device configuredto accept gestures and/or commands to manipulate a graphical object onthe display unit, wherein, when executed by the processor, theinstructions perform the steps of: receiving a text string; associatinga voice parameter with a portion of the text string; and displaying bythe display unit the graphical object comprising a representation of thetext string and the voice parameter, wherein the voice parameter on thedisplay unit is represented by a visible curve of voice parameter valuesplotted against frames, and wherein each respective phoneme in the textstring is visually associated with particular ones of the frames;receiving via the input device a command to modify the voice parameter;and modifying the voice parameter according to the command.
 2. Thedevice of claim 1, wherein the voice parameter comprises a prosodycharacteristic.
 3. The device of claim 1, wherein the voice parameter isbounded by a personality profile consisting of at least one of the groupof a vocal tract length, a pitch range, a phrase duration, a pauseduration.
 4. The device of claim 1, wherein modifying the voiceparameter of the audio waveform is in accordance with a parameter rangeof an audio rendering device configured to render the audio waveform. 5.The device of claim 1, further comprising the step of converting by theprocessor a gesture detected by the input device into the command. 6.The device of claim 1, further comprising the step of associating atimestamp with the voice parameter.
 7. The device of claim 1, whereinthe display and the input device comprise a touch screen configured todetect a single touch and/or multi-touch gesture.
 8. The device of claim1, wherein the voice parameter comprises a markup symbol added to thetext string to provide rendering instructions to the speech synthesizer.9. The device of claim 8, wherein the markup symbol indicates a value orrange for one or more vocal parameters, selected from the groupconsisting of pitch, duration, amplitude, vocal tract dimension,sibilance, prosody width, and silence.
 10. The device of claim 9,wherein the markup symbol indicates the voice parameter is to berandomized to prevent repeated utterances from sounding identical,wherein a degree of randomness is specified by specifying a high and lowrange for the parameter's value.
 11. The device of claim 1, wherein thegraphical object comprises an envelope controller.
 12. The device ofclaim 1 wherein the display is configured to present the text string,the voice parameter, and a second voice parameter as a trajectory.
 13. Amethod for controlling a voice animation for a text-to-speechsynthesizer in real-time, comprising the steps of: initializing a voiceidentifier with plurality of Voice Style Sheet (VSS) parametersformatted in a VSS file, wherein the VSS file comprises a text stringfor each respective one of the plurality of VSS parameters thatidentifies values for an origin, a width, an amplitude, and a sustainfor that VSS parameter for rendering text-to-speech; receiving a textstring; generating a plurality of phonetic labels for a rendering of thetext string; receiving an input indicating a modification to theplurality of VSS parameters; modifying the plurality of VSS parametersaccording to the modification; and generating audio samples according tothe plurality of modified VSS parameters.
 14. The method of claim 13,wherein the modification refers to a duration of a portion of the voiceanimation.
 15. The method of claim 13, wherein the modification refersto an acoustic feature of a portion of the voice animation.
 16. Themethod of claim 13, further comprising the step of assigning a timestampto a voice parameter of the plurality of VSS parameters.
 17. A speech totext system comprising: a text and labels module configured to receive atext input and provide a text analysis and a label comprising a phoneticdescription of the text; a label buffer configured to receive the labelfrom the text and labels module; a parameter generation moduleconfigured to access the label from the label buffer and generate aspeech generation parameter; a parameter buffer configured to receivethe parameter from the parameter generation module; an audio generationmodule configured to receive the text input, the label, and/or theparameter and generate a plurality of audio samples; and a schedulerconfigured to monitor and schedule at least one of the group consistingof the text and label module, the parameter generation module, and theaudio generation module; wherein the parameter generation module isfurther configured to perform the steps of: initializing a voiceidentifier with a Voice Style Sheet (VSS) parameter formatted in a VSSfile, wherein the VSS file comprises a text string for each respectiveone of the plurality of VSS parameters that identifies values for anorigin, a width, an amplitude, and a sustain for that VSS parameter forrendering text-to-speech; receiving an input indicating a modificationto the VSS parameter; and modifying the VSS parameter according to themodification.
 18. The system of claim 17, further comprising a controlinterface configured to display the display voice animation control dataand provide an interface to receive real-time input to manipulate theanimation control data.
 19. The system of claim 17, wherein the audiogeneration module further comprises a text-to-speech (TTS) playbackdevice configured to receive input comprising text and formatted controldata for rendering by an audio transducer in real-time.
 20. The systemof claim 17, wherein the plurality of audio samples comprises a speechsynthesis of the text input.
 21. The device of claim 19, wherein theaudio generation module further comprises an audio transducer.
 22. Thedevice of claim 17, further comprising a sample buffer configured toreceive the plurality of samples from the audio generation module. 23.The device of claim 1, wherein the memory stores a plurality of VoiceStyle Sheet (VSS) parameters formatted in a VSS file, wherein the VSSfile comprises a text string for each respective one of the plurality ofVSS parameters that identifies values for an origin, a width, anamplitude, and a sustain for that VSS parameter for renderingtext-to-speech.
 24. The method of claim 13, further comprising:displaying by a display unit, a graphical object comprising arepresentation of the text string and one of the VSS parameters, whereinthe VSS parameter on the display unit is represented by a visible curveof voice parameter values plotted against frames, and wherein eachrespective phoneme in the text string is visually associated withparticular ones of the frames.
 25. The system of claim 17, furthercomprising: a display unit configured to display a graphical objectcomprising a representation of the text string and one of the pluralityof VSS parameters, wherein the VSS parameter on the display unit isrepresented by a visible curve of the VSS parameter's values plottedagainst frames, and wherein each respective phoneme in the input text isvisually associated with particular ones of the frames.
 26. The deviceof claim 9, wherein the markup symbol indicates the voice parameter isto be randomized to prevent repeated utterances from sounding identical,wherein a degree of randomness is specified by specifying a probabilitythat the parameter adjustment will be applied during a currentrendering.
 27. A computer-implemented method of statistical parametricspeech synthesis, the method comprising: analyzing a text string with atext analyzer to produce phonetic labels lexically and phoneticallydescribing the text string; using the phonetic labels to access contextdependent models for acoustic features and duration; generatingparameters from the context dependent models for all of the phoneticlabels; translating controls to a voice style sheet format identifyingparameters including pitch (fO), spectrum, duration, vocal tract length,and aperiodicity per frame for all the phonetic labels; providing acontrol interface for real-time manipulation of the parameters;synthesizing a set of audio samples with a vocoder based on theparameters and the real-time manipulation of the parameters at thecontrol interface to produce a synthesized speech waveform of the textstring; and rendering the synthesized speech with a rendering system.